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CONSTRUCTION OF DECISION TREES 
Edwin Roger Banks 



The construction of optimal decision trees for the problem stated 
within can be accomplished oy an exhaustive enumeration. This paper 
discusses two approaches. The section on he u r_i_s tic jnethod s gives 
mostly negative results (e,g< there is no merit factor that will 
always yield the optimal test, etc.). but most of these methods do 
give good results. The section entitled "Exhaustive Enumeration 
Revisited"' indicates some powerful shortcuts that can be applied to 
an exhaustive enumeration, extending the range of this method. 



CONSTRUCTION OF DECISION TREES 
Edwin Roger Banks 

INTRODUCTION 

A* The Problem 

This paper considers the optimal procedure for determining 
whether a network of switches is open or closed* Each switch i 
h«s an a priori probability p. of being closed *nd an associated 

cost C. to determine the condition of the switch 4 The problem can 
also be expressed in Polish notation as, for example: 

(A2TO (OR (AMD Tj Tj) Tj) (OR T 4 Tj) ) 

where the T are tests with true or false outcomes and associated 

p. (of being true) and C . A third formulation of the problem will 

be used in this paper: The problem tree for the above Polish form 

is shown with the network representation: 

T, AND 






3 4 5 
NETWORK (Closed?) T" t, 12 

a. one form. b* short form 
PROBLEM TREE 
Bridge networks will be disallowed. 

Our goal is to obtain the decision tree of minimum expected cost, 
A typical decision tr«« for the above problem tree is shown: 





where /indicates 

\open circuit; 
/ V indicates a 
2 5 closed circuit - 



1 4 



\ 



DECISION TREE 



The expected cost of this decision tree is: 

c 3 + 

p 3 (c 4 + q 4 c 5 > + 

q 3 (C 1 + p l (C 2 + p 2 (C 4 + % C 5 ) * * 

where q ■ 1 - p is the a priori probability of being open - The above 
formula was obtained from the tree as follows. C~ 1a the first test; 
it is made unconditionally. If the test results in switch 3 being 
closed p then the parenthesized part of the second line will become 
the expected cost of determining the remaining part of the circuit. 
And switch 3 is closed with probability p.* 

Consideration of this problem probably originated (see Berlekamp ) 
in an attempt to optimize telephone switching circuits.* Another area 
is problem solving where the solution to s problem can involve solving 
sub- problems. In our problem tree, the AND represents a division 
of the problem into sub-problems which jointly must be solved, and th* 
OR, those for which the solution of any sub-problem solves the problem. 



*Jn this problem the cost is time. 



Let n be the number of switches, and X the number of possible 
decision trees. To be considered in X we require only that the tree 
has no repeated tests. (Otherwise X = cd ) Then X becomes i 

X - n (X . + 1) and X- = 1 

or 

X - 1, 8, 243, 238144, 200Q0Q00G0OO, . .. for n = 1, 2, 3, 4, 5, ., 

This formula can be critized because it counts incomplete decision 

trees, but if X is to rule out these trees, then it becomes a function 

n ■ 

of how the n switches are interconnected. In any event* 2 procedure 

which attempts to find optimal treea by exhaustive enumeration And 
comparison will be practically limited to n*4 or legs. 

Interestingly* out of the 238000 tree* for the 4-switch network, 
for example, there are exactly tight which qualify as a possibly 
opt i pis 1 decision tree, regardless of the values of the C.» the p 
or how connected I Results of this sort are reported in the section 
titled "EXHAUSTIVE ENUMERATION REVISITED' 1 , 

The above results were discovered after an initial effort to 
apply heuristics to the problem. Several interesting theorems » in* 
eluding the failure theorem which shows that counter-examples exist 
for a broad class of heuristics, arc reported in the next section* 

The last section considers the technique of test-at-a-tim** 1, ft, 
instead of constructing a good decision tree which is simply read in 
order to choose a test, how can a good test be chosen without the 
tree? The nerit of this technique Is that the decision tree is 
typically very large requiring storage space* (If a bushy problem 

tree is assumed, the average path length of the decision tree will 

n/2 
be n/2 giving a total size of about 2 ,) 



HEURISTIC TECKWIQUES 

The first class of heuristics involves the merit" fact or approach. 
A nerlt factor It calculated for each switch and the largest {or 
smallest) merit factor determine* the teat. For each possible out- 
come of the teat (open or closed) the circuit- diagram is simplified 
and new writ factors are cotaputed. The inert t factor may be a function 
of the costa or probabilities of any or all of the switches and of 
the atructure of the network itself, and other factors. We will 
design a few aerlt factors and analyze then* 

A uaeful quantity to include in a merit-factor is the a priori 
probability that the entire circuit is closed. Let F designate this 
quantity which ia easily evaluated from the p. and the problem tree. 
We will also use Q ■ 1 - F as the total probability of an open circuit* 
Two other useful quantities are AP and £±Q.+ j^P will represent 
the increase which results in P if switch i is closed. Similarly* 
AQ t vill represent the increase in Q for switch i open . 

As tests are made, a plot of P against cost can be 




The goal is the top or bottom 
line, 

P as a Function of Tests Drawn for a Particular Decision Tree, 

This diagram suggested most of the following merit factors. 

1. Heuristic; make that test which gives the greatest expected 
change in P per unit cost. Thus we define 

2. Heuristic; make that test which gives the greatest percent 

change in P per unit cost? 

*2 " < Jk f E - + ^f 2 - > ' C 

3. Heuristic; weight the factors of F- by F and Q, Thus w« 
multiply the expected increase by F which Is the probability 
that an increase is desired, 

F 3 ■ (p - P AP + q Q AQ> / C 



4. Heuristic: make the test which yields the largest expected 
percent reduction in distance remaining to the goal; 

* 4 ■ < ^ + *^» > / « 

THEOREM I. The above heuristics axe equivalent {within constant*) 
to each other and to the slnplji merit factor Fj : 

E - p AP / C 

To prove THEOREM I we need LEWHA I; 

LEMMA I- P is individually linear in p t * lliat is, for function* 6 

and Hr 

Inductive proof of LEMMA I, 

Any network (including bridge type networks) can be built 
by starting with one switch and two nodes and adding switches and 
nodes by use of two methods* 

I. By connecting two existing nodes by a switch, 
II* By adding a switch and a node between a switch and a node. 



><>-< 



yo-$=* >-&**-t 



I, CONNECTING TOO NODES BY II, INSERTING A SWITCH AND NODE 
A SWITCH BETWEEN A SWITCH AND NODE 

The leu&a is obviously true for a circuit with only one switch, and 
two nodes. Consider case I. first. Let us add switch i. P 1 is 
the new probability and P is individually linear by the induction 
hypothesis. 

P 1 * p t P" + (1 - p t ) P 
P" is a network with one less node and is therefore satisfactory. 
Obviously P 1 is individually linear* Considering case II as we add 
switch i; 

P 1 - P t P + <1 - P £ ) P" 
again still individually linear In p., Q.-E.*D* 



Nov we can prove THEOREM 1. First we show the interesting 
equivalence 

P t AP t - <J 4 ±*t t for ill i. 

Since P ic individually linear in p- we can write 

P = G + H p £ 

where G and B are the functions of the probabilities of the other 
switches* APj ia the increase In P if p . =1 or 

£ k ? i = ( G + H-l) - (C + Hp^ - H(l-p 1 )>Hq i 

And since 

G + Kp i - G + H (l-q 4 ) - (G + H)-M t 

and Q * 1 - p = ( i + g + H) + H q £ 

we can show in a similar manner as above for ^r that 

AQ t - B p 4 

Therefore 4 

p l AP i " q i AQ i B H p i a i 

Lat us ua« X as a shorthand for pAP to pro ve th* equivalence of 
our writ factors * 

F. - (pJlP + qAQ) / C - (X + X) / C ■ 2 X / C 

F 2 = ( X/P + X/Q ) / C = ( 1/PQ > X / C since P + Q • 1 
and F and Q are i&itially the sane for all teats. 

F 3 -(PX + QX)/C»X/C 

F^ - ( X/Q + X/P ) / C = ( 1/PQ ) X / C 

Alao any wonotonically Increasing functional coobination of F. *«* F_ 
will be equivalent to F,. in the sense that it will yield the sane 
test and hence the same decision tree. 



On* might expect that the time to find the largest p£P / C 
would be at least proportional to the number of switches. Thus to 
construct a decision tree would require a tine proportional to n 
tinea the number of tests in the entire decision tree* Actually 
a time proportional to log n is all that is needed after the first 
test is chosen, if all the previous results Is saved in tree form* 

We have need 2 as the expected size of the decision tree, 
(Total sice of a binary tree la twice the number of end points.) 
The following Intuitively obvious THEOREM suggests why the decision 
tree ehould be so large. 

THEOREM II. There is at least one path of length n in every (complete) 
decision tree, i. e. it is always possible that when using the 
decision tree, the state of every switch would have to be determined 
to determine the state of the circuit (open or closed). 
Inductive Proof of THEOREM 11/ 

The theorem is obviously true for one switch. There are two 
casc5--cithcr a- switch is connected in parallel or in series* If 
in parallel and if open, then there results a network with one 
fewer switch and THEOREM II applies to this. A parallel argument 
(pardon the pun) applies to the series connection. Thus an average 
depth of about n/2 results. It should be noted that for an? 
technique yielding optimal trees, the tl*e required is probably at least 
proportional to the siae of tha tree it is building giving 2 
as a lover bound. ( It is realized that perhaps the tree contains 
many duplicate sub-structures in which case a time less than the above 
■ay exist.) 



FAILURE THEOREM 



THEOREM IIIi Hiere does not exist a merit factor F, such that the 
optimal decision tree is always found by picking that switch with 
maximum (or ■inlflum) F. F can be a ftinction of p.* C,, the problem 
tree structure, the other p (J 1* i)» indeed of anything except the 
other switches' costs- ¥. is restricted only co ba Independent of 

Cj for j * L, 
Proof* 

Consider tha following network: 



switches 1* 2, 3 




all p. - 1/2 



Also consider tha following assignments of costs to the switches 
with optimal decision trees shown below* 



C l ' 4, 
(^ -i- 

C 3 - 1.99 





C t - 4. 

c 2 -i. 

C 3 -2.01 


3 

A 



In the first case F 2 y F l while in the second case F^ Fj. But 
the only change was in C-. and Fj^ and F 2 were not allowed to depend 
on C„* 



Berlekamp gave the following interesting example whose optitaal 
decision tree is shown* Again the numbers are costs; all p. = 1/2. 




10,000 




L000 10 

10W> 100 

100 1000 

100O0 

However , the following two trees oust be equivalent where W, X, Y, and 

2 are substructures. 
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BQUXVALWT TREES 

The in tares ting fact la that the costs 1 and 1' in the example 
can be varied independently over a small range while both forms of 
the decision tree remain equivalent. 



Winston" has devised scans heuristics for building trees that 
add tests and than do local improvements which can move the n«w tests 
to a position of lesser depth in the tree. His decision trees are for 
a more general problem in which the tests are not sirople true or false, 
but divide a set of objects into two classes. 

Reinwald and So land present an algorithm for finding the optimal 
tree, but Winston claims that It is only slightly better than exhaustive 
enumeration* 



BERLEKAKP'S RESULTS 

Berlekainp defined a couple of merle factors but used them in 
m different way from our usage* He defined a parallel merit factor 
PWF - Pi/ C 4 and * aartes merit factor SMF =■ ^t^i and showed that for 
a pure par* lie 1 or pure aeries network, then PHF and SMF (when evalu- 
ated and ordered for all switches) would determine the optimal tree. 
In fact, extending this to a parallel-series (i.e* a parallel 
connection of pure series circuita) or series-parallel (beads on a 
string) network s he found the following algorithm would give the 
optimal tree. We illustrate her* only the series-parallel case* 



□ 



Example of Series -Para lie I network. 

I- Replace each bead by a single switch whoae cost is the expected 
cost of the bead and whose probability Is the probability of the 
entire bead 4 

II* Calculate SMF to pick the bead for the first teat. 

Ill, Within the bead, calculate PMF to determine the switch. This 

switch is the first teat. 

IVk Simplify the network (for both caae*--open and closed switch) and 
start over 4 

Although the method always yields the optimal tree for problems 
whose problem tree has a depth of two, unfortunately attempts to 
generalize to higher depths fail- Berlekamp gave the following counter- 
example (depth of J.-) where the numbers ar* the coats of the tests* 



100 



I 10 

□ 



1000 



all p - 1/2 



COUNTER- EXAMF 3 £ 




OPTIMAL DEC IS I0W TREE 



Although the optimal tree Above has an expected cost of 206.5, the 

application of Berlekamp'a method yields * tree with expected cost 

of 230*25* In fact Berlekamp found a formula for how much the method 

costs. 

BERIEKAMF'S THEOREM IV. Let T c be the cost of the but strategy in 

which parallel branches a and b are looked at consecutively and T 

opt 

the cost of the optimal strategy. Then 

(assuming P^/t %, P h^ C b'' " The * lr * t BiCtor is the difference in 
the merit factors; the second is the cost of the first branch, and 
the third is the cost of the equivalent combination of the two branches. 
CGROLIAKY* If the merit factors are equal* the branches may be com- 
bined with no loss in expected cost. 

However* Berlektwap did show the following useful theorem. 
BERLEKAMP'S THEOREM III* If the optimal strategy starts in a particular 
bead b,, it will start at that branch of B, which has the highest 
parallel merit factor* 

Attempts to generalize this theorem also failed* as shown in the 
next theorem. 

BERLEKAMP'S FALSE CONJECTURE IV* If the optimal Strategy starts in 
a substructure N* it will start at the sane place that it would have 
started in n alone. (Hot true , ) 



An attempt to get the effect of the entire structure into the 
problem involved substituting electrical resistors for the switches. 
This type of operation has been known to work for some other problems - 
We would like to have the resistance be a function of the probability 
such that the belt switch if found by Caking that switch which has 
i) maxtaum current* 11.) tt&xtmum voltage drop or lii) maximum power 
dissipation per unit cost assuming a one volt applied voltage. 
If such a method is to work for our problem , it should satisfy the 
following requirements; 

I* For p = 0, R « oo and for p - 1, R ■ 

II. Reals tors should combine into equivalent resistors by the 
parallel and series combination laws* 

Unfortunately the combination lews specify the form of the equation 
and I. specifies the initial conditions giving two different equations 
depending on whether R is in series or parallel* 




Resistance, R 
The fact that the two curves are so similar suggest* that soma 
combination of R and R should give good results- 



EXHAUSTIVE EJt&ffiRATlON REVISITED 

In the set of All possible decision trees for a Large number of 
switches, n s only a very small fraction of these are possible as op- 
timal treea* Some of these have identical costs. 

The following results assume that the problem tree is slightly 
modified to contain the switches ordered by their (Berlekanp) merit 
factors at any particular level, aa shovn in the example* 
b 




e 

-O- 



-o- 



PMF b ^ PKF £ 




SMF d >/ SMF e >/ SMF f 



(depth * 3) 

PROBLEM TREE 



Berlekamp's algorithm can be used on any problem whose depth is two 
or less. To help eliminate non-optimal trees, THEOREM IV can be used. 

TKBORQi IV. At any node in a decision tree, the maximum number of 
remaining tests assuming the outcome is a closed switch and the maxi- 
mum number for an outcome of open switch are different numbers. 
Proof* 

In the proof of THEQRffl II It was noted that one outcome gave 
a subproblem of exactly one less switch* But the other case must have 
cut off part of the circuit giving a still smaller network. For instance 
if the switch was connected in series and was open, then at least tvo 
fewer switches result in the subproblsnw 

This theorem inmed lately eliminates all trees that end in either 
of the following structures. 

i \ 



A 



A 



b 



c 



THEOREM. V. Ho structure of the following form can exist at an opt tea 1 
decision tree* 




This theorem states that any two switches a and b cannot appear at 
the same level as a parallel connection in one case and series in 
another ♦ 
Argument* 

In the problem tree> a and b have a youngest cotmon ancestor 
which la at the parallel or series level, but whichever, it cannot 
change* (This theorem does not apply to bridge-type network*-} 

THE0RB4 VI. In tha 3 switch network composed of switch a in series 
with parallel b and c 4 if the SMF for a Is greater than the nulaum 
SHF for b and c, then the tests will be applied in order of SMF. 
Similar results hold for the dual case. 



I 1 The simplest network for which fierlekamp'a method falls (1. e. 
depth greater than two) has four switches • This Is the only 4 switch 
network for depth of three. 




3 4 

PROBLEM TREE 






NETWORK 



DUAL NETWORK 



FRiQBL&i TREE 



Since the dual network is solved exactly the sane as the original 
network giving Identical decision tree*, only one of each dual pair 
is to be considered* 



The formula for X (the number of dec it ion trees) gives about 
230,000 tree* for the 4 switch problem. However, no more than the 
following eight treed need be considered. All the rest must be non- 
optimal. 
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H9W did these eight arise? First BERLEKAXF'S THEOREM III plus the 
fact that the switches are ordered by their merit factors before we 
start eliminates switch 4 fron consideration immediately* Now consider 
proving, for example* that the following tree is non- optimal. We 
will generate a sequence of equivalent trees the Last of which la 
more costly than one of the above eight* 






s > 



more costly than 






The above eight trees can be lumped into three classes where 
within each circle the depth is two or less and Berlekamp's method 
can be used to find the optical tree. 

1 2. 





Only these three cases now need be considered. (We are now assuming 
that if the depth of the problem tree is less than three, Berlekamp's 
«ethod i* applied; but if the depth Is three, then these three trees 
are considered.) 

Going to five switches there are four meaningful problem tree* 
of depth three and one of depth four. 




4 5 




FIVE SWITCH DEPTH = 3 TREES 




FIVB SWITCH DEPTH • 4 TREE 

Depending on which of the above problem trees is being considered, 
there are only 7, 5, 5, 5, or 8 trial to be taade, respectively 
(see Appendix I). For n - 5, X ■ X. • 2 X 10 approximately. 

XL J 

Optimal trees were found for each of the listed 4 switch trees, 
but perhaps some of the 5 switch trees can be eliminated by 
further study* 

Considering six switches, there are at most five way? to pick 
the first one since at least one way violates the theorew of Eerlekasp* 
By THEOREM V one of the sub-problems (after the first one is taade) 
has five switches and the other has four or less- Thus the greatest 
time required is proportional to 5(8+3)" 55. 



To further speed these operation*, some other techniques could 
have been used. For example, we have seen that some decision trees 
give Identical costs end only one member of each family should be 
considered. Another area for improvement concerns avoiding duplication 
of effort on those sub -structures which appear many times vithln a 
larger tree. Furthermore the techniques of this chapter have completely 
abandoned consideration of the values of the p and C, of the switches. 
Some function of these values could probably divide the exhaustive 
enumeration into binary halves, greatly speeding the calculations- 
It i^ very probable that these and other pruning techniques could be 
developed to the point that fairly large networks (say 20 switches) 
could be handled in a reasonable amount of time. But if ve remember 
that the size of a typical decision tree itself grows as 2 we 
night find that about 20 to 30 would represent an upper bound on all 
techniques. 



TEST AT A TIME METHOD 

Since for large values of n tha size of the decision tree becomes 
unreasonably large , it becomes desirable to have a procedure which 

determines the switches as the problem develops* Since the pAF/C 
method could be used (time to find th* jnaximura value of this merit 
factor was proportional to log n) in a look-ahead scheme as a static 
evaluation function , tests could be determined very rapidly. 
Unfortunatly for the methods of the preceeding section , it appears that 
the entire tree would have to be determined to find the first test. 
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APPENDIX I 

Tho five-switch problem: depth>3. (e>u«Js m* Ga«i'*Ur*A 
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